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Thfi  Problem: 

FORTRAN  has  been  in  widespread  use  for  ~40  years  in  scientific 
and  industrial  computation.  On  the  order  of  10  programmer- 
hours  or  10®  programmer-years  has  been  using  this  language  to 
create  an  enormously  valuable  legacy  of  code. 

A  major  CFD  Code  (Computational  Eiuid  Dynamics)  typically 
represents  1  to  100  man-years  of  effort,  and  as  such  can  be  a 

significant  financial  asset  if  it  can  be  kept  running  across 

generations  of  computers  or  it  can  become  a  major  liability  if  it 
must  be  re-written  when  new  computers  become  available. 

FORTRAN  was  designed  to  run  efficiently  on  primitive  serial 
computers  (it  had  to  compete  at  birth  head  to  head  with  machine 
assembly  code)  and  its  widespread  use  influenced  the  develop¬ 
ment  of  the  most  advanced  machines  used  for  numerical 
calculations  in  the  60’s,  70’s  and  80’s  (c.f.  CDC  6400,6600,7600 
CRAY  etc.). 

However,  soon  the  performance  limits  achievable  by  a  single 
serial  processor  economically  became  a  limit  to  the  monotonic 
growth  of  computer  performance  and  alternative  approaches 
(most  typically  dedicated  multiprocessor  configurations)  have 
been  offered  as  ways  to  surmount  these  constraints. 

Initially,  access  to  gains  in  performance  through  multi-processor 
execution  was  possible  only  through  use  of  a  programming 
language  unique  to  that  machine.  This  cost  constraint  severely 
limited  the  exploitation  of  multi-processor  machines  and  the 
runway  for  the  take-off  of  their  widespread  use  is  littered  with  the 
bankrupt  wrecks  of  such  parallel  computing  companies  as  Cray 
Computers,  Connection  Machines,  Kendall  Square,  Hypercube, 
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Floating  Point  Systems,  Myrias,  the  makers  of  the  DAP  machine, 
Gtc 

However,  in  the  last  10  years,  two  independent  computing 
threads: 

1.  Inexpensive  Desktops  Scientific  Work  stations  (e.g.  SUN) 

2.  Cheap  Personal  Computers  (IBM  PCs  &  their  clones) 

have  created  a  large  successful  commercial  market  selling  very 
high  performance  serial  computing  units  very  cheaply. 

[A  75  MHz  Pentium®  PC  can  be  purchased  for  under  $1,500  today 
and  has  a  double  precision  FORTRAN  performance  of  ~5 
Megaflops  on  real  CFD  code.  The  current  scientific  workstations 
cost  -$20,000  and  have  a  performance  in  the  50  Megaflop  range 

on  real  code] 

The  attraction  of  harnessing  this  affordable  computing  power  in 
multiple  processor  configurations  has  manifested  itself  in  two 
hardware  developments; 

1.  ‘tightly  bound’  custom  multiprocessor  dedicated  Massively 
Parallel  Supercomputers  using  off  the  shelf  scientific  work¬ 
station  CPU  chips: 


Chips 
DEC  Alpha 
SUN  Sparc 
Intel  860 


Machine 
Cray  T3D/E 
Fujitsu  AP1000 
Paragon 
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2.  ‘loosely  bound’  clusters  of  ‘off-the-shelf  workstations 
connected  by: 

slow  Normal  Ethernet  10  Mbits/sec. 

medium  Fast  Ethernet  100  Mbits/sec. 

fast  FDDI  >200  Mbits/sec. 


Meanwhile,  Cray  Research,  Inc.  (who  invented  the  supercom¬ 
puter  market  and  is  still  in  business!)  have  produced  a  shared 
memory  multiple  CPU  computer  as  its  top  performing  model, 
connecting  4-64  500+  Megaflop  vector  processing  CPU  units  to 
a  large  (~  1  gigaword)  memory.  Each  processor  can  see  all  of 

the  memory. 

On  the  software  side,  the  Inefficiency  of  a  parochial  language  for 
each  machine  has  been  recognised  as  a  problem.  The  proposed 
solution  is  a  ‘common-standard’  of  library  subroutines  to 
manage  the  exchange  of  data  between  separate  processors/in  a 
uniform  way  from  within  FORTRAN.  First  PVM  and  more  lately 
MPl  (Message  Passing  interface)  have  been  created  by 
enthusiastic  committees  of  world  experts.  MPl  is  now  available 
on  most  major  UNIX  platforms  (including  PCs  via  LINUX)  and 
uses  a  standard  set  of  FORTRAN  Subroutine  calls. 


Thus  we  now  have  the  prospect  of  producing  a  single  version  of 
a  FORTRAN  program  that  can  run  efficiently  on  serial,  parallel  or 
massively  parallel  computer  architectures. 

In  an  ideal  world,  all  major  code  would  be  re-written  from  the 
ground  up  to  exploit  this  portability.  In  practice,  it  Is  necessary 
to  port  mature  programs  over  to  this  new  computing 
environment. 
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I  will  now  report  on  a  successful  case  study  of  this  porting 
process  and  what  was  learned  from  this  effort  that  has  wider 
implications  of  the  FORTRAN  CFD  community. 
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Thfi  Navy  I  Ocean  Model 

This  program  models  oceans  as  a  stack  (3<_N  <  6)  of  shallow  layers 
of  variable  thickness.  The  ahallow-water  spproxirnMion. 

Each  layer  has  a  distinct  density  pi ,  {isopycnaD- 

Within  each  layer  the  spherical  sha//ow  wafer  equations  are  solved  fo 
the  horizontal  velocities  (u,,  v,),  layer  thickness  h, ,  temperature  T, , 
salinity  S,and  any  tracer  components  O',-  (pollutants,  oxygen 
isotopes,  experimental  injectants,  etc.). 

The  layers  interact  with  each  other  through  the  pressure  term  which 
is  a  sum  of  the  weight  of  the  water  column  above  each  layer  /  at  the 
point  (^i,Qj)  {longitude,  latitude}  and  any  paramaterized  mass 
exchang©  betwaan  tha  layers. 

The  uppermost  layer  is  driven  by  surface  winds,  rain  and  sunshine  ‘ 
from  weather  models,  the  deepest  level  experiences  bottom  Urag. 
Each  layer  exerts  drag  on  the  levels  above  it  and  below  it. 

This  model  reflects  decades  of  physical  observations  of  the  ocean 
which  reveal  them  to  be  composed  of  distinct  and  recognisable  water 
masses.  The  stacking  into  layers  is  caused  by  daily,  seasonal  and 
long  term  climatological  processes.  This  stacking  is  found  to  be  ver 
stable  except  in  the  extreme  North  Atlantic  in  winter  when  there  are 
episodic  Deep  Mixing  events  converting  surface  water  to  bottom 
water  a  few  ten  cubic  miles  at  a  time.  There  are  compensating  up- 
welling  events  (El  Nino,  circum-Antarctic  current)  that  occur  sporat- 
ically  around  the  world  in  known  locations.  This  means  that  vertical 
mixing  processes  are  unimportant  over  most  of  the  worlds  oceans. 
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Here  are  the  actual  equations  that  are  modelled  in  each  layer: 

2.5  Summary-A  Single  Layer  with’  Uniform  Density 


2.5.1  Continuity 


Sh  1  [  ^  d{V  cos  fl) 

dt  ^  acos^  [  5^  d$ 


=  sonxces  -  sinlcs 


(24) 


2.5.2  Momentum 


5 

(  ^  ' 

= 

d(p 

Vcos0, 

)  -cos9- 

5 

(  ^ 

e^s  = 

5^ 

\COS0, 

) 

'elocities  (u, 

v)  are 

dejined  from 

hu  =  U, 
hv  =  V. 


(25a) 

(25b) 

(26a) 

(26b) 


component: 
dt  ^  a  cos  0 


d{U  u)  ^  d{Vucos6) 
d4>  ~  50 


—  V  u  sin  $  —  aClV  sin  20 


hg  dh 


a  cos  6  d<f> 


-f  + 


cos^  0 


d{h  cos  de^)  ,  d{h  cos^  0  e^e) 
~  50 


(27) 


€e  component: 


X  ^  i  Uuslne  +  any  sm20 

dt  ‘  a  cos  0  [  5^  50 


hg  dh 

a  50 


+  hFs  + 


a}  cos^  0 


d{h  cosde^)  d{h  cos^  0 e^) 

d^  50 


(28) 


2.5.0  Scalar  transport 


d{hT)  1  r  5(yr)  a(i^T  cos  0)1  _ 


5t 


a  cos  I 


J'' 


cos 


—  f  —  +  COS0^  sources  -  sinks. 

^  0  I  5<i  \  d4> }  dd  \  50 /J 


(29) 
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This  approach  is  in  contrast  to  fixed-level  codes  where  the  ocean  is 
divided  verticaiiy  into  j  fixed  ievels  at  depths  Z;,  10  <_/  <  50. 

An  ocean  basin,  such  as  the  Pacific  is  covered  with  a  uniform  grid  or 
mesh  and  finite  difference  equations  approximating  the  partial  ^ 

differential  equations  set  out  for  each  layer  are  solved  at  each  time 
step  to  advance  the  field  variables  in  time. 

The  initial  conditions  are  supplied  from  established  climatological 
atlases  and  the  ocean  basins  are  spun-up  for  several  years  or 
decades  to  establish  sensible  motions,  temperatures  and  salinity. 

This  model  has  been  in  research  use  since  1978  when  its  original 
authors  published  a  study  of  the  Loop  Current  in  the  Gulf  of  Mexico. 

A  recent  result  from  the  model  featured  on  the  cover  of  Nature  last 
summer  (4  August  1994)  and  for  the  rest  of  that  year  and  this  year  on 
their  CD-ROM  promotion  literature. 

This  model  is  classic  supercomputer  legacy  code,  having  started  life 
on  a  Texas  Instruments  Advanced  Scientific  Computer  and  evolved 
through  CRAY  1-S,  CRAY  X-MP,  CYBER  -  203,  205  and  ETA  10,  and  i 
now  running  on  an  8-processor  CRAY  C90. 

In  1992  the  US  Department  of  Defense  started  a  High  Performance 
Computer  Modernisation  Program.  This  program  is  buying  a  state-of 
the-art  computer  for  about  $25,000,000  every  year  and  establishing  a 
Major  Shared  Resource  Center  to  house  and  maintain  this  facility. 
These  HPC  centers  will  be  the  main  resource  for  DoD  sponsored 
computing  forth©  for©s©0abl0  futur©. 

Ther©  is  v©ry  strong  prossur©  to  mov©  all  major  DoD  bas©d 
computing  onto  th©s©  now  machinos,  typically  of  MPP  architoctur©. 
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Technical  problems: 

Although  it  is  possible  to  solve  the  finite  difference  approximations  t 
the  shallow  water  equations  explicitly  (which  lends  itself  immediately . 
to  a  tiled  approach  on  MPP  architecture)  the  need  to  resolve  the 
surface  gravity  wave  mo6e(tidal  waves)  limit  the  possible  time  step  t 
values  too  small(to  ensure  numerical  stability)  to  be  economical. 

The  semi-implicit  approach  filters  out  these  surface  waves  at  the  cos 
of  having  to  solve  a  Helmholz  equation  for  each  layer  thickness 
variable,  hj. 

It  is  necessary  to  solve  this  equation: 

V'  h|’  -  Ci'  h,’  =  Si’Cfe) 

over  the  true  ocean  basin  shape  and  depth  distribution. 

This  is  the  classic  fractal  problem.  Here  hj’  is  a  linear  combitiation  of 
the  original  hj  variables  that  allow  a  decomposition  of  the  pressure 
driving  terms  in  the  velocity  advection  equations  into  separate 
modes.  In  general, 

where  c-,  represents  the  whole  ocean  mode,  while  C2,  C3 ...  represent 
various  internal  modes  of  motion  between  layers. 

The  oceanographers  call  the  c^  mode  barotropic  and  the  C2, 

C3 ...  modes  the  baroclinic  modes. 

For  C2,  C3 ...  equation  (1)  can  be  solved  to  sufficient  accuracy  in  a  few 
SOR  relaxations,  a  technique  easily  moved  to  MPP  architecture,  but 
for  the  Ci  case,  SOR  converges  much  too  slowly  and  an  exact  solver 
must  be  used. 
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The  serial  computer  code  used  the  Capacitance  Matrix  Technique  . 
(CMT)  of  Hockney  (1970).  This  approach  can  be  set  out  schematicall 

as  follows: 

1 .  Given  Si’{(t>,e),  find  h,”,  a  solution  field  from  an  exact  solver 
for  3  pGrfGct  rGctsngulsr  domsin. 

2.  Find  an  Grror  fiGld  e(^,Q)  =  h<i”  -  h-i]  -  {^,Q)  at  N 

spGcial  boundary  points. 

3.  CalcuIatG  a  corrGction  fiold  q’  =  [CMT]  g  whGfG  [CMT]  is 
an  N  by  N  matrix. 

4.  Exact  solvG  v"  -  q’  on  thG  pGrfGct  do¬ 

main  to  find  a  solution  field  with  no  error  at  the  boundaries. 

This  CMT  technique  lends  itself  well  to  MPP  architecture  as  the  error 
calculation  is  a  local  process  and  the  matrix  multiply  can  be 
distributed  efficiently  over  P  processors. 

The  key  is  the  exact  solver! 

Here  the  serial  code  used  the  Fourier  Analysis  -  Cyclic  Reduction 
technique  of  Temperton  (1977). 

This  method  uses  a  Discrete  Fourier  Transform  in  one  space  directio 
followed  by  a  Tridiagonal  equation  solve  for  each  Fourier  mode  in  the 
orthogonal  direction  finished  off  by  a  Discrete  Fourier  Inverse  to 
recover  the  physical  variable  values  from  their  Fourier  mode 
representation. 
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The  efficiency  of  this  method  is  so  marked  that  it  is  in  widespread 
use,  but  neither  Fourier  Analysis  nor  Tridiagonal  equation  solving  liv 
well  in  an  Parallel  execution  environment. 

Our  approach  is  to: 

1 .  Adopt  a  1  -D  mapping  of  long  thin  tiles  to  allow  the  FFTs  to 
take  place  entirely  within  each  tile. 

2.  Adopt  a  block  -  pipeline  tridiagonal  solver  using  the  Burn  at 
Both  Ends  (BABE)  algorithm.  This  kept  only  a  small  number 
of  processors  idle  at  the  start  and  the  finish  of  the 
Tridiagonal  solving  cycle. 

This  strategy  gave  promise  of  being  moderately  efficient  in  a  coarse 
MPP  environment  (4  <  P  <_256) . 

Overall,  the  bulk  of  the  ocean  model  calculation  is  done  on  a  2-D  tiUn 
covering  the  domain.  Only  for  the  barodlnic  mode  is  the  Si’J[(}),e)  field 
mapped  from  its  2-D  form  to  a  1-D  form.  This  allows  a  simple  SPIVID 
(Single  Program  Multiple  Data)  approach  to  the  rest  of  the  code. 
Exchange  of  Halo  Data  is  moved  to  a  standard  subroutine  which  is 
made  architecture  or  communication  protocol  specific. 

Only  the  Baroclinic  Helmholz  Solver  requires  specific  communication 
modes  beyond  the  standard  2-D  tile  Halo  update. 

We  can  analyse  the  cost  of  this  approach  within  an  MPP  environment 

Summarising  the  FACR(O)  algorithm  as  we  have  implemented  it. 

1.  Total  Operation  Count  =  8  log2(Nv)NvN^  (to  leading  order) 

P 
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2.  Total  Data  Movement:  =  2{P-1)  Nx 

Values  for  Nx  are  1024,  2048  or  4096.  These  give  global  resolutions  o 
y/,  1/8  °  and  1/16°  respectively. 

On  a  global  grid  where  A(j)  =  A6  ,  Ny=  Nx/2. 

If  we  adopt  a  naive  model  of  data  communication,  characterising  it  as 
being  proportional  to  the  time  taken  for  a  floating  point  operation  wit 
a  constant  L,  we  can  estimate  the  degree  to  which  this  approach 
scales  over  increasing  numbers  of  processors  P. 

Let  us  characterise  the  performance  of  the  algorithm  as 

Performance(p)  =  time(P=.1) 

time(P) 

Here  is  a  plot  of  this  function  for  a  value  of  4  chosen  for  L. 
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This  corresponds  well  to  the  Cray  T3D  timings  we  have  obtained  for 
1024  X  768  case: 


p 

time 

ratio 

8 

0.1298 

16 

0.0666 

1.95 

32 

0.0369 

1.80 

64 

0.0227 

1.63 

The  overall  results  suggests  that  this  approach  will  allow  the  FACR(0 
to  run  easily  and  efficiently  for  32  <  P  ^256  which  should  match  wel 
the  P  values  proposed  for  most  general  purpose  MPP  machines. 

The  efficiency  of  the  complete  CMT  algorithm  depends  both  upon  the 
efficiency  of  the  FACR(O)  solver  and  on  the  number  of  special 
boundary  points  (B)  that  define  the  coastline  of  the  ocean  basin. 

As  a  coastline  is  the  archtypical  fractal  length,  B  would  be  expected 
to  scale  on  Nx  in  a  non-linear  fashion.  Empirical  results  from  the  - 

standard  bathymetric  data  base  suggests  that  B«  /3NX  .  In  practice, 

the  coastline  is  smoothed  to  limit  the  size  of  the  CIVIT  matrix. 


For  th©  currsnt  sst  of  World  Ocean  models  in  use,'  we  use. 

N  B  Size  of  CWIT  (bytes) 


512 

3929 

3861 

62IVI 

1024 

8915 

10922 

318IVI 

2048 

18936 

30843 

1.4G 

4096 

999 

m  «  « 

87381 

30.5G 
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For  the  complete  CMT  Algorithm  (including  the  2-D  to  1-D  and  back 
mapping  costs)  we  find: 

Floating  point  ops  =  1ILB+2B"  +  2  FACR(O)  costs 

per  processor  P 

Data  movement  =  (1-1/P)(Nx^  +  2B)  +  2  FACR(O)  costs 

Here  are  the  performance  curves  for  an  L  value  of  4  for  the  3  cases 
where  an  actual  value  of  B  is  set  (there  is  strong  pressure  to  smooth 
the  boundary  data  to  reduce  the  size  of  the  capacitance  matrix.). 
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Prngramrninq  Style: 

To  achieve  portability  between  distributed  memory  systems 
(Workstation  Clusters)  and  shared  memory  systems  (CRAY  C90) 
a  programming  style  was  adopted  which  allowed  for  both  models 
well  as  for  testing  on  a  single  serial  machine  on  one  tile. 

The  domain  was  divided  up  into  Mproc  by  Np^c  tiles  such  that: 

lyiproc^  Nproc  “ 

|VIp„,  and  Nproc  could  both  be  1 . 

A  double  set  of  outer  loops  over  all  the  2-D  tiles  was  introduced 
around  each  computation  statement;  (e.g.  for  a  Red-Black  SOR 

type  loop; 

COMMON/PROC2D/MPROC,NPROC 
DIMENSION  R  (NX,NY,MPROC,NPROC)  , 

+  p (NX , NY , MPROC , NPROC ) 

DO  JPR0C=1, NPROC 
DO  IPROC=l, MPROC 
DO  J=2,NY-1 

DO  I=2+MOD (J, 2) ,NX-2+MOD (J, 2) 

P  (I, J,IPROC, JPROC)  =  ... 

ENDDO ; ENDDO ; ENDDO ; ENDDO 

CALL  PASS  HALO(P) 
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The  outer  loop  limits  mproc,nproc  are  set  at  compile 
Dossible  to  ensure  efficient  optimization.  In  advanced  FORTRAN  or 
fn  FORTRAN90/95  the  two  outer  loop  commands  would  be  replace 
by  K)RALL  statements  to  indicate  that  simultaneous  execution 
outer  loops  is  possible  (and  encouragedl). 

u  coo  tuniral  M  N  values  used  on  the  CRAY  T3D  (run  in 

CHAV  C90  |sha,.d  oamo.,: 


type 

Shared  Mem. 
Distrib.  Mem. 


'PROC 


IPROC 


•PROC 


and  NpRoc  can  be  set  at  Compile  time  or  at  Run  time. 


The  Ocean  Model  code  is  now  designed  to  port  quickly  to  a  yari^y  of 
MPP  ensembles.  The  default  data  communication  protocol '®  ''''P'- 
Only  one  processor  (#0)  is  allowed  to  do  external  I/O.  The  code  has 
been  tested  on  a  cluster  of  SUN  10  work  stations  linked  by  ethernet 
running  MPICH.  Because  the  inter-tile  communication  is  localized  to 
a  single  generic  Halo  Update  subroutine  for  the  2-D  tile  P°rt'on  of  * 
program  (the  bulk!),  and  to  a  few  specialised  communications  in  the 
CMT/FASCR(0)  solver,  tailoring  the  code  to 

oarochial  data  communication  protocol  (e.g.  CRAY  T3D  in  share 
memory  mode)  is  straightforward  and  can  be  implemented  only  wher 

critical. 

The  prospect  of  running  the  code  on  a  cluster  of  High  performance 
workstations  is  attractive.  However,  investigations  ° 

the  CMT/FACR(0)  algorithm  when  L  is  increased  from  4  to  the  much 
higher  values  expected  in  a  loosely  bound  cluster  suggest  that  the 
scaling  potential  is  much  reduced. 

16 


'  Performance 


Modernizing  Legacy  CFD  Code: 
The  US  Navy  Layered  Ocean  Model 


23/10/95 
Coimbra  ‘95 


Here  are  efficiency  curves  for  the  case: 

N  =  1024,  Ny  =  576,  B=8915  and  L  =  4, 10,  40, 100. 
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Pyperlencai 

Dr.  Wallcraft  and  I  found  it  easiest  to  proceed  in  two  steps: 

1  Produce  a  shared  memory  version  of  the  code  using  2-0  tiimg 
(1-D  tiiing  in  the  Barotropic  Solver)  isolating  the  inter-tile  commu¬ 
nication  steps.  This  model  will  run  on  the  CRAY  C90  in  multi- 

tasking  mode. 

2.  Produce  a  MPI  version  of  the  code  by  writing  WIPI  Subroutine  calls 
to  implement  to  inter-tile  communication  steps. 

Step  1  ensured  that  the  calculations  on  each  tile  could  proceed 
independently  and  localised  the  communication. 

Step  2  confined  the  MPl  based  code  to  a  very  few  easily  tagged 
subroutines  that  could  be  replaced  by  more  efficient  communication 

protocols  whon  appropriato. 

Debugging  was  possible  on  a  small  cluster  (P=4)  of  SUN  workstation 
running  MPl.  It  was  then  possible  to  move  the  code  directly  onto  a 
large  P  (64!)  CRAY  T3D  with  no  further  modification. 

Local  efficient  serial  codes  can  enhanced  the  overall  performance 
considerably  and  the  advantages  of  being  able  to  exploit  them 
may  overcome  the  cost  of  reshuffling  the  data  from  a  2-D  tile  form  to 
1-D  tile  form  and  back  again  (the  FFT  routine!). 
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Hnnclusiori-^  for  2-D  CFD  Code^ 

1.  Producing  MPP  environment  modern  code  from  mature  production  ^ 
serial  CFD  code  need  not  be  an  impossible  task. 

2.  Moving  in  stages  from  SPSD  to  SPMD  to  MPMD  seems  easier  and 
less  error  prone  than  a  single  SPSD  to  MPMD  jump. 

3.  Some  classical  serial  algorithms  will  survive  the  move  to  coarse 

MPP  environments  (P<100)  that  are  Tightly  bound  (  <  ). 

4.  In  large  P  environments,  the  size  of  L  is  critical. 

5.  Moderate  sized  workstation  clusters  (P~8)  can  be  efficiently 

exploited  to  run  these  codes  5  times  faster  than  a  single 
workstation.  . : 

6.  Artificial  data  structures  to  exploit  local  efficient  serial  code  may 
be  justified,  especially  for  dedicated  ‘tightly  bound’  machines. 

7  There  may  yet  be  life  left  in  some  ‘old-dogs’  [efficient  serial 
algorithms]  such  as  Hockney’s  Method  in  an  MPP  environment 

with  P  ~  100. 
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Dr  Wallcraft  and  I  feel  that  we  have  been  able  to  port  a  large  mature  - 
program  to  a  generic  MPP  environment  with  a  simple  standard  of 
programming  styie  (outer  ioops  over  tiles)  and  a  default  choice  of 
inter  processor  communication  (MPI).  This  creates  a  code  that  can 
be  readily  moved  onto  a  new  High  Performance  Computer  and 
production  type  runs  can  be  started  with  a  minimum  of  porting  effort. 
Later  customisation  to  exploit  local  faster  communications  protocols 
can  be  exploited  once  WIPI  based  experience  has  been  gained  that 
highlights  specific  data  bottie  necks,  it  is  hoped  that  this  leaves  the 
NRL  Ocean  Modelling  group  well  positioned  to  exploit  the 
opportunities  opening  up  from  the  DOD  HPC  initiative. 
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